Introduction

The distributions of annual peak flows at USGS streamgages are used estimate flood frequency-magnitude characteristics at gaged sites by use of the log-Peason type-III distribution. This distribution is described by measures of the mean, variance, and skewness of the logarithm of annual peak flows. Estimates of high magnitude floods are particularly sensitive to the skewness of the distribution, although skewness estimates based on data from an individual streamgage are highly variable. Based on guidelines provided in Bulletin 17B and 17C, a regional estimate of flood skew is combined with an at-site (station skew) estimate to help stabilize the estimate of flood frequency characteristics.

A Bayesian generalized least squares (B-GLS) model is the standard approach for estimating the regional skew by regression on basin and climatic characteristics. This approach accounts for the length of record and spatial correlations among contemporaneous annual peaks in a network of streamgages by appropriately adjusting model parameters and uncertainty estimates. Despite the capabilities of the underlying model, identifying individual basin or climatic characteristics that are statistically associated with flood skew is problematic. Therefore, a constant is commonly used to estimate regional skew under the B-GLS model.

It may be the case, however, that subtle spatial variations in regional skew are caused by a combination of factors that are not linearly associated with available basin or climatic characteristics. Such effects may result in a persistent spatial pattern in the residuals of the skew estimates. To assess this possibility, a generalized additive model (GAM) is applied to the skew residuals from a B-GLS analysis of two, two-digit hydrologic regions, which includes US Great Lakes (04) and the Ohio River (05) basins. The GAM model provides a flexible basis of smoothing functions to accommodate irregular variations in covariates, such as spatial coordinates of gaged basin centroids. Linear components also are accommodated in a GAM model.

Computed station skew statistics from 368 streamgages in hydrologic unit regions 04 and 05 with 35 or more years of annual peak flow data were used in this analysis. Station skews were computed by use of methods described in Bulletin 17C. A preliminary analysis of the data describes the spatial distribution of gaged basin centroids, and analyzes the distribution of station skew values for outliers. A GAM model was fit to the skew values using the eastings and northing of basin centroids to provide an initial assessment of the potential utility of the GAM model. Within the GAM analysis, station-skews for individual streamgages will be weighted by their record length. This preliminary assessment was followed by repeated, random partitioning of the data set into training and testing subsets containing 80- and 20-percent of the full data set, respectively. The testing data set was used to determine whether the GAM model outperformed the constant estimated by use of the B-GLS model.

Initialize Environment

library(tidyverse)
-- Attaching packages --------------------------------------- tidyverse 1.2.0 --
v ggplot2 2.2.1     v purrr   0.2.4
v tibble  1.3.4     v dplyr   0.7.4
v tidyr   0.7.2     v stringr 1.2.0
v readr   1.1.1     v forcats 0.2.0
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
library(knitr)

Read in Skew Data for HUC Regions 04 and 05

df <- read_tsv(file = 'C:/Home/SW_Specialist/Skew0405/data/skew_data.txt') %>% 
   mutate(Station = paste0('0',USGS),
          Resid_skew_sign = factor(sign(Residuals))) %>% 
   rename(Record_len      = Pseudo_Length,
          Station_skew    = Station_Skew,
          Residual_skew   = Residuals) 
Parsed with column specification:
cols(
  Index = col_integer(),
  USGS = col_integer(),
  Pseudo_Length = col_integer(),
  Station_Skew = col_double(),
  log10_DA = col_double(),
  Lat_Cntrd = col_double(),
  Long_Cntrd = col_double(),
  Residuals = col_double()
)

List Subset of Data and Summarize

kable(df[1:20,c('Station','Station_skew','Residual_skew','Lat_Cntrd','Long_Cntrd')],
      caption = 'Sample of Skew Residual Data')
Station Station_skew Residual_skew Lat_Cntrd Long_Cntrd
03336645 0.419 0.332 40.39385 -88.00935
03336900 0.000 -0.086 40.25089 -88.07321
03343400 -0.736 -0.822 39.92401 -88.14127
03344500 0.102 0.016 39.34818 -88.00714
03345500 -0.827 -0.913 39.51656 -88.13700
03346000 -0.304 -0.390 39.26853 -87.91700
03378000 -0.456 -0.542 38.53839 -87.94023
03378635 0.334 0.248 39.30351 -88.52165
03379500 -0.599 -0.686 38.99769 -88.48511
03380500 -0.037 -0.123 38.56331 -88.72489
03381500 0.239 0.153 38.64497 -88.44497
03382100 1.739 1.653 37.61436 -88.85263
03384450 -0.355 -0.441 37.53137 -88.52718
03385000 -0.231 -0.317 37.47848 -88.61897
03274650 -1.257 -1.343 40.04106 -85.10903
03275000 -0.360 -0.446 39.84952 -85.09757
03275600 -0.210 -0.297 39.86655 -84.84453
03276700 0.518 0.431 39.08198 -85.11773
03277000 0.076 -0.010 39.13191 -85.21422
03291780 -0.639 -0.725 38.93353 -85.28100

Plot Distribution of Streamgage Centrois

Note: symbols discretize residuals into positive and negative values

us_map %>% 
   filter(region %in% c('michigan', 'ohio', 'indiana', 'wisconsin', 'illinois',
                        'pennsylvania','new york', 'kentucky', 'west virginia',
                        'vermont')) %>% 
   ggplot( aes(x = long, y = lat, group = group)) + 
   geom_polygon(fill = 'tan', color = 'black') + 
   geom_polygon(data = huc2_0405, aes(x = long, y = lat, group = group),
                fill = NA, color = 'blue') +
   geom_point(  data = df, aes( x = Long_Cntrd, y = Lat_Cntrd, group = NULL), 
                color = 'red') + 
   coord_map('conic', lat0 = 42)
Regions defined for each Polygons


Project Latitudes and Longitudes of Basin Centroids

library(rgdal)
library(sp)
df_prj  <- df
coordinates(df_prj) <- c('Long_Cntrd', 'Lat_Cntrd')
class(df_prj)
[1] "SpatialPointsDataFrame"
attr(,"package")
[1] "sp"
proj4string(df_prj) <- "+proj=longlat +datum=NAD83"
# Same transform as 
#  EPSG:102003 USA_Contiguous_Albers_Equal_Area_Conic'
proj_sel <-  'EPSG:102004 USA_Contiguous_Lambert_Conformal_Conic'
#  EPSG:102005 USA_Contiguous_Equidistant_Conic
df_prj <- spTransform(df_prj, CRS = CRS("+init=esri:102004"))
easting  <- attributes(df_prj)$coords[,1]
northing <- attributes(df_prj)$coords[,2]
plot(easting, northing, pch = 16, col = 'blue')

east_std <- (easting  - mean(easting ))/100000
nrth_std <- (northing - mean(northing))/100000
plot(east_std, nrth_std, pch = 16, col = 'green4',
     main = paste('Regional Skew Streamgages',proj_sel),
     xlab = 'Standardized Easting', ylab = 'Standardized Northing')

df %>%
   ggplot( aes( x = Station_skew)) +
   geom_histogram() + 
   geom_vline( xintercept = 0, color = 'red', linetype = 'dashed')

ndxOut <- which(df$Station_skew > 2)
# Remove outlier
df <- df[-ndxOut,]
df %>%
   ggplot( aes( x = Station_skew)) +
   geom_freqpoly( aes(y = ..density..), color = 'salmon', size = 1.2) + 
   geom_vline( xintercept = 0, color = 'red', linetype = 'dashed') +
   stat_function(fun = dnorm, args = list(mean = mean(df$Station_skew),
                                          sd   =   sd(df$Station_skew)),
                 color = 'blue') +
   theme_few()

library(mgcv)
df$east <- east_std[-ndxOut]
df$nrth <- nrth_std[-ndxOut]
gam1 <- gam(Station_skew ~ s(east, nrth), weights = Record_len, data = df)
summary(gam1)

Family: gaussian 
Link function: identity 

Formula:
Station_skew ~ s(east, nrth)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  0.07062    0.02185   3.231  0.00135 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
               edf Ref.df     F  p-value    
s(east,nrth) 24.42  27.72 3.847 7.74e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.215   Deviance explained = 26.8%
GCV = 12.151  Scale est. = 11.307    n = 366
plot(gam1)

df$gam_resid <- gam1$residuals
df %>% 
   ggplot( aes(x = gam_resid) ) +
   geom_density(colour = 'red', size = 1.5) +
   geom_density( aes(x = Residual_skew), color = 'blue', size = 1.5) +
   theme_few() 

us_map %>% 
   filter(region %in% c('michigan', 'ohio', 'indiana', 'wisconsin', 'illinois',
                        'pennsylvania','new york', 'kentucky', 'west virginia',
                        'vermont')) %>% 
   ggplot( aes(x = long, y = lat, group = group)) + 
   geom_polygon(fill = 'tan', color = 'black') + 
   coord_map('conic', lat0 = 42) +
   # theme_void() + # This eliminates lat/lon marks 
   # theme_few() +
   geom_point(data = df, aes( x=Long_Cntrd, y = Lat_Cntrd, 
                              colour = gam_resid, group = NULL)) +
   scale_colour_gradient2(low  = "red" , mid = "white",
                          high = "blue" )

Skew Contour Map

A skew contour map is developed that is based on the gam1 model above.

plot_ly( x = c(xvec), y = c(yvec),
   z = matrix(zvec,40,40, byrow = TRUE), type = 'contour', 
   contours = list(start = -0.5, end = 0.7, size = 0.2, showlabels = TRUE)) %>% 
   add_trace(type = 'scatter', mode = 'markers', x = df$east, y = df$nrth, 
               color = 'red', hovertext = paste(df$Station, df$Station_skew)) 
'scatter' objects don't have these attributes: 'z', 'contours'
Valid attributes include:
'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'hoverinfo', 'hoverlabel', 'stream', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'text', 'hovertext', 'mode', 'hoveron', 'line', 'connectgaps', 'cliponaxis', 'fill', 'fillcolor', 'marker', 'textposition', 'textfont', 'r', 't', 'error_y', 'error_x', 'xaxis', 'yaxis', 'xcalendar', 'ycalendar', 'idssrc', 'customdatasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'hovertextsrc', 'textpositionsrc', 'rsrc', 'tsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule'
'scatter' objects don't have these attributes: 'z', 'contours'
Valid attributes include:
'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'hoverinfo', 'hoverlabel', 'stream', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'text', 'hovertext', 'mode', 'hoveron', 'line', 'connectgaps', 'cliponaxis', 'fill', 'fillcolor', 'marker', 'textposition', 'textfont', 'r', 't', 'error_y', 'error_x', 'xaxis', 'yaxis', 'xcalendar', 'ycalendar', 'idssrc', 'customdatasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'hovertextsrc', 'textpositionsrc', 'rsrc', 'tsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule'
---
title: "Preliminary Analysis of Skew Residuals in HUCs 04 and 05"
author: Dave Holtschlag
output: html_notebook
---

## Introduction

The distributions of annual peak flows at USGS streamgages are used estimate flood frequency-magnitude characteristics at gaged sites by use of the log-Peason type-III distribution.  This distribution is described by measures of the mean, variance, and skewness of the logarithm of annual peak flows.  Estimates of high magnitude floods are particularly sensitive to the skewness of the distribution, although skewness estimates based on data from an individual streamgage are highly variable. Based on guidelines provided in Bulletin 17B and 17C, a regional estimate of flood skew is combined with an at-site (station skew) estimate to help stabilize the estimate of flood frequency characteristics.  

A Bayesian generalized least squares (B-GLS) model is the standard approach for estimating the regional skew by regression on basin and climatic characteristics.  This approach accounts for the length of record and spatial correlations among contemporaneous annual peaks in a network of streamgages by appropriately adjusting model parameters and uncertainty estimates. Despite the capabilities of the underlying model, identifying individual basin or climatic characteristics that are statistically associated with flood skew is problematic.  Therefore, a constant is commonly used to estimate regional skew under the B-GLS model.  

It may be the case, however, that subtle spatial variations in regional skew are caused by a combination of factors that are not linearly associated with available basin or climatic characteristics. Such effects may result in a persistent spatial pattern in the residuals of the skew estimates.  To assess this possibility, a generalized additive model (GAM) is applied to the skew residuals from a B-GLS analysis of two, two-digit hydrologic regions, which includes US Great Lakes (04) and the Ohio River (05) basins. The GAM model provides a flexible basis of smoothing functions to accommodate irregular variations in covariates, such as spatial coordinates of gaged basin centroids.  Linear components also are accommodated in a GAM model.

Computed station skew statistics from 368 streamgages in hydrologic unit regions 04 and 05 with 35 or more years of annual peak flow data were used in this analysis.  Station skews were computed by use of methods described in Bulletin 17C.  A preliminary analysis of the data describes the spatial distribution of gaged basin centroids, and analyzes the distribution of station skew values for outliers.  A GAM model was fit to the skew values using the eastings and northing of basin centroids to provide an initial assessment of the potential utility of the GAM model. Within the GAM analysis, station-skews for individual streamgages will be weighted by their record length. This preliminary assessment was followed by repeated, random partitioning of the data set into training and testing subsets containing 80- and 20-percent of the full data set, respectively.  The testing data set was used to determine whether the GAM model outperformed the constant estimated by use of the B-GLS model. 

## Initialize Environment

```{r setup}
library(tidyverse)
library(ggplot2)
library(ggthemes)
library(knitr)
```
***

## Read in Skew Data for HUC Regions 04 and 05

```{r read_data}
df <- read_tsv(file = 'C:/Home/SW_Specialist/Skew0405/data/skew_data.txt') %>% 
   mutate(Station = paste0('0',USGS),
          Resid_skew_sign = factor(sign(Residuals))) %>% 
   rename(Record_len      = Pseudo_Length,
          Station_skew    = Station_Skew,
          Residual_skew   = Residuals) 

```

## List Subset of Data and Summarize

```{r list_data, fig.cap = 'Table showing contents of skew data set'}

kable(df[1:20,c('Station','Station_skew','Residual_skew','Lat_Cntrd','Long_Cntrd')],
      caption = 'Sample of Skew Residual Data')

# Print summary of 
summary(df)

```

## Plot Distribution of Streamgage Centrois

Note: symbols discretize residuals into positive and negative values

```{r convert_coord}
library(rgdal)
library(sp)

path <- 'C:/Home/SW_Specialist/Skew0405/GIS/WBD_05_HU2_Shape/Shape' 
huc2_ohio_river  <- readOGR(dsn = path, layer = "WBDHU2")

path <- 'C:/Home/SW_Specialist/Skew0405/GIS/WBD_04_HU2_Shape/Shape'
huc2_great_lakes <- readOGR(dsn = path, layer = "WBDHU2") 

huc2_0405        <- rbind(huc2_great_lakes, huc2_ohio_river)

# How to extract dataframe from spatial polygon
spp <- SpatialPolygonsDataFrame(huc2_0405, data = as.data.frame(HUC2))
dfx <- data.frame(id = getSpPPolygonsIDSlots(huc2_0405))
row.names(dfx) <- getSpPPolygonsIDSlots(huc2_0405)
spdf <- SpatialPolygonsDataFrame(huc2_0405, data =dfx)

huc2_0405 %>% 
   ggplot( aes(x = long, y = lat, group = group)) +
   geom_polygon(fill = colors()[18], color = "black") +
   geom_point(df, aes(x = Long_Cntrd, y = Lat_Cntrd))


   geom_polygon(huc2_ohio_river, aes(x = long, y = lat, group = group),
                fill = colors()[15], color = 'red')


fill = colors()[18], 
+ 
   
   geom_polygon(data = states_ohio_river, 
                aes(x = long, y = lat, group = group),
                fill = NA, color = 'grey') +
   geom_point(data = skew_sites, 
              aes( x = LONG_CENT, y = LAT_CENT, group = NULL),
              color = 'red', 
              size  = 0.5) +
   coord_map() +
   theme_few() + xlab('Longitude') + ylab('Latitude') 




us_map <- map_data('state')

us_map %>% 
   filter(region %in% c('michigan', 'ohio', 'indiana', 'wisconsin', 'illinois',
                        'pennsylvania','new york', 'kentucky', 'west virginia',
                        'vermont')) %>% 
   ggplot( aes(x = long, y = lat, group = group)) + 
   geom_polygon(fill = 'tan', color = 'black') + 
   geom_polygon(data = huc2_0405, aes(x = long, y = lat, group = group),
                fill = NA, color = 'blue') +
   geom_point(  data = df, aes( x = Long_Cntrd, y = Lat_Cntrd, group = NULL), 
                color = 'red') + 
   coord_map('conic', lat0 = 42) +
   # theme_void() + # This eliminates lat/lon marks 
   theme_few() +
   geom_point(data = df, aes( x=Long_Cntrd, y = Lat_Cntrd, 
                              colour = Resid_skew_sign, group = NULL)) +
   scale_colour_manual( values = c('red', 'blue'))


```
***
## Project Latitudes and Longitudes of Basin Centroids

```{r proj_lat_long}
library(rgdal)
library(sp)

df_prj  <- df

coordinates(df_prj) <- c('Long_Cntrd', 'Lat_Cntrd')
class(df_prj)

proj4string(df_prj) <- "+proj=longlat +datum=NAD83"

# Same transform as 
#  EPSG:102003 USA_Contiguous_Albers_Equal_Area_Conic'
proj_sel <-  'EPSG:102004 USA_Contiguous_Lambert_Conformal_Conic'
#  EPSG:102005 USA_Contiguous_Equidistant_Conic
df_prj <- spTransform(df_prj, CRS = CRS("+init=esri:102004"))

easting  <- attributes(df_prj)$coords[,1]
northing <- attributes(df_prj)$coords[,2]


plot(easting, northing, pch = 16, col = 'blue')

east_std <- (easting  - mean(easting ))/100000
nrth_std <- (northing - mean(northing))/100000

plot(east_std, nrth_std, pch = 16, col = 'green4',
     main = paste('Regional Skew Streamgages',proj_sel),
     xlab = 'Standardized Easting', ylab = 'Standardized Northing')

```
```{r skew_dist}

df %>%
   ggplot( aes( x = Station_skew)) +
   geom_histogram() + 
   geom_vline( xintercept = 0, color = 'red', linetype = 'dashed')

ndxOut <- which(df$Station_skew > 2)

# Remove outlier

df <- df[-ndxOut,]


df %>%
   ggplot( aes( x = Station_skew)) +
   geom_freqpoly( aes(y = ..density..), color = 'salmon', size = 1.2) + 
   geom_vline( xintercept = 0, color = 'red', linetype = 'dashed') +
   stat_function(fun = dnorm, args = list(mean = mean(df$Station_skew),
                                          sd   =   sd(df$Station_skew)),
                 color = 'blue') +
   theme_few()

```


```{r gam_anal}

library(mgcv)

df$east <- east_std[-ndxOut]
df$nrth <- nrth_std[-ndxOut]


gam1 <- gam(Station_skew ~ s(east, nrth), weights = Record_len, data = df)

summary(gam1)

plot(gam1)

df$gam_resid <- gam1$residuals

```



```{r resid_plot}

df %>% 
   ggplot( aes(x = gam_resid) ) +
   geom_density(colour = 'red', size = 1.5) +
   geom_density( aes(x = Residual_skew), color = 'blue', size = 1.5) +
   theme_few() 

us_map %>% 
   filter(region %in% c('michigan', 'ohio', 'indiana', 'wisconsin', 'illinois',
                        'pennsylvania','new york', 'kentucky', 'west virginia',
                        'vermont')) %>% 
   ggplot( aes(x = long, y = lat, group = group)) + 
   geom_polygon(fill = 'tan', color = 'black') + 
   coord_map('conic', lat0 = 42) +
   # theme_void() + # This eliminates lat/lon marks 
   # theme_few() +
   geom_point(data = df, aes( x=Long_Cntrd, y = Lat_Cntrd, 
                              colour = gam_resid, group = NULL)) +
   scale_colour_gradient2(low  = "red" , mid = "white",
                          high = "blue" )


```

## Skew Contour Map

A skew contour map is developed that is based on the gam1 model above.  

```{r contour_info}

tmp  <- plot(gam1)


xvec <- tmp[[1]]$x
yvec <- tmp[[1]]$y
zvec <- tmp[[1]]$fit

plot_ly( x = c(xvec), y = c(yvec),
   z = matrix(zvec,40,40, byrow = TRUE), type = 'contour', 
   contours = list(start = -0.5, end = 0.7, size = 0.2, showlabels = TRUE)) %>% 
   add_trace(type = 'scatter', mode = 'markers', x = df$east, y = df$nrth, 
               color = 'red', hovertext = paste(df$Station, df$Station_skew)) 




```